home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
The PC-SIG Library 10
/
The PC-Sig Library - Shareware for the IBM PC and Compatibles (PC-SIG)(Tenth Edition Disks 1-2804)(1991).iso
/
PC_SIGCD
/
07
/
3
/
DISK0731.ZIP
/
CAFR
next >
Wrap
Text File
|
1987-02-09
|
11KB
|
320 lines
Date: January 25, 1987
For more information:
James E. Tarvid
Logical Support
State Road 700 at Peach Bottom Creek
Route 2 Box 73
Independence, Virginia 24348
(703) 773-3419
-----------------------------------------------------------------------------
FILE RETRIEVAL USING INVERTED BIT VECTORS - James E. Tarvid
ABSTRACT
This paper describes a computer software technology for rapid retrieval
of information from unstructured data such as collections of letters,
resumes, abstracts, reports, or depositions residing on a mass storage device
such as diskettes, hard disks, and compact disks using compact indices. The
method uses a two step process of indexing and retrieval. The index
structure employed is a high density inverted bit matrix indexed by word and
file. A prototype for demonstrating the technology is available for personal
computers.
INTRODUCTION
Most of the keystrokes eventually stored on computer systems comprise
tokens or more simply words of identity, quality and quantity. The
information contained therein depends on both the form and the content of the
information structures and their interpretation by the end user. The
expansion of computer usage has increased that proportion of stored
information which is unstructured while the evolution of computer technology
counters with ever greater yet still primitive understanding of information
structures themselves. In spite of the advances in "artificial
intelligence", we are years away from adequately automatically translating
raw text into information structures. Only a small amount of the information
entered on computers is available in a form suitable for retrieval by "fourth
generation languages" (4GL). In the mean time, there is a need to support
human activity in retrieving information from that vast bulk of unstructured
information.
INFORMATION STRUCTURE ASSUMPTIONS
Unstructured information such as letters, depositions, resumes, reports
and abstracts are generally stored as text files. From the primitive
computer point of view, these files consists of one long string of
characters. Without too much loss of information or functionality, we can
consider the files to consist of words separated by punctuation and blank
space. Obviously, the information contained depends on the juxtaposition of
qualifiers and quantifiers, but for the purposes of the technology described
herein, we shall be content with the notion that text is a sequence of words
with no discernible significance in their order.
COST FACTORS
In computer parlance, we are predicating a structure of entities, which
are files, and attributes, which are words (independent of the position of
words within the file). More naturally we are talking about an index of
files by word. Obviously, a more sophisticated structure is conceivable but
we are responding here to economic needs and constraints of space and time.
The index consists of entity-attribute couples which have as items a word and
a file which contains that word. We can produce such an index by scanning
files, producing file-word couples, sorting on word, eliminating duplicates
and merging. Given such an index, it is easy to see how to quickly return a
list of files containing a particular word. It is also easy to see that such
an index would be much larger than the text itself.
INFORMATION COMPRESSION
Natural text is frequently compressed by the use of abbreviations. We
commonly accept "USA" for the "United States of America" or "MD" for "medical
doctor. Some technical compression efforts are natural and easy. For
example, the identifier for a file can be reduced to a number if a
corresponding structure maps that number back into a file name. That is we
can use the number "5" for the file "xyz" if the fifth slot of a table
contains "xyz". Now we can represent the index as a dictionary with a list
of file numbers. The size of the dictionary is still rather large but
efficiency improves as the number of files increases and in some cases the
costs of the dictionary may well be within value of the index. Still to be
generally useful, a much smaller index is needed.
Less naturally, words can be converted into numbers by treating the
characters as mathematical operands. At the simplest we can treat the word
as a large binary integer. For text encoded in ASCII, the American Standard
Code for Information Interchange", the range of values produced extends from
65 for the binary bit pattern for the word "A" into the billions for binary
bit patterns for short words such as "name". The range can be compressed by
restricting the domain to the 26 letters ignoring case, but the range of
radix 26 number is still large and consequently not very dense. Fortunately,
a technique known as hashing has been around for decades. Hashing words
consists of converting the letters making up a word to a number and then
"folding" the number into a limited range by dividing by a constant and
taking the remainder. The hash code may be thought of as a mathematical
abbreviation. There is a loss of specificity since two or more words may
produce the same remainder much as an abbreviation may stand for two
different words ("Dr" for "doctor" and "drive"). Such words are called
"scramble synonyms". The loss of specificity, while troublesome in some
regards, permits a much greater density of information in the index.
At this point, the problem being addressed can be reduced a square
matrix of code, file number pairs. Both of the matrix indices are dense and
can be represented as a "bit" matrix where the row index is the hash codes
and the column index is the file number and a bit indicates which files
contain a word which generate the corresponding hash code.
PROGRAM LOGIC
Producing the index and using it become a simple matter.
Indexing consists of reading each file to be indexed, hashing each word,
and setting the bit in a vector corresponding to the hash code. Each file
produces a separate vector. The set of vectors comprises a matrix which is
transposed and written to disk. If the matrix is too large to fit in memory
successive transposed matrices are merged. In a hierarchical directory
system, the scope of files to be indexed can be presumed to be a directory
node, its files, and all its sub nodes and their files. Unwanted files may
be classified by extension or type and indexing of common words suppressed.
Incremental indexing can be provided on file attributes such as "date".
Retrieval consists of hashing each word in a "keyword list", retrieving
the rows corresponding to the hash codes, "and-ing" the rows together,
searching the row for set bits, and converting the index to a file name.
Other boolean operators in the retrieval language such as "or" for synonyms
are easy to implement.
SCRAMBLE SYNONYM PROBLEM
As noted before, the price of using a hash code to compress the index is
that two or more words may have the same code. A retrieval specifying a
single word is likely to include names of files which do not in fact contain
the word. First, it should be noted that every file containing the specified
word is returned by a retrieval. Second, the use of multiple keywords in a
retrieval greatly diminishes the number of "false hits". Third, a post
retrieval scan could be made. In fact, using inverted bit matrices before
doing more complicated searches could greatly improve their efficiency as
well.
APPLICATIONS
While the scope of applications amenable to this technique is extremely
broad, it may be helpful to consider a few applications.
Matching skills with requirements is common in large diverse
organizations. Typically a list of commonly required skills is drawn up and
each individual's skills are surveyed. The individual-skill record is
inverted (sorted or indexed) on skill. From such an index, a list of
individuals having a particular skill is easily produced. Several problems
can arise. First, the survey may not include the requisite skills. It is
difficult to anticipate all the skills that might be required in an
organization. Second, a fair amount of effort is required to prepare the
skills matrix and maintain it, hence it is likely to be less than complete or
current in its contents. Now consider the case where the employment
histories are on disk along with project summaries. After generating an
index and periodically adding new files, a current list of individuals that
may have the necessary attributes can be generated instantly.
Word processing centers in organizations may produce thousands of
documents. Typically, these documents are discarded or archived on diskette,
which is tantamount to the same thing. A typical letter contains about 2000
characters and costs about $2 to type. The cost of permanent disk storage
for such a letter is less than one penny yet virtually no one could can make
a general query such as "has anyone corresponded with John Smith in Nigeria
on the use of time controlled valves in drip irrigation?". The inverted
index makes such a query and even more obscure queries simple.
Corporate and institutional libraries have proven invaluable in support
of professional activity. Eventually, much of the material is indexed or
abstracted, and in the hands of a skilled librarian the needed references may
make it into the hands of users but probably not in time for critical needs.
Now consider the situation in which staff members are encouraged to make
brief comments about significant material they encounter in routine
operations. Collected on disk and indexed say once per week, any
professional could instantly generate a list of references along with the
judgments of their associates.
Real estate brokers have combined with their colleagues through
"multiple listing services" to match buyers with properties. Although the
information management services provided are expensive and generally
cumbersome, they provide little more information than property descriptions,
buyer profiles, a PC with a hard disk, a word processing program, and an
inverted index would provide at a fraction of the cost.
PROTOTYPE
A prototype of the system is available for MS-DOS and PC-DOS compatible
systems at nominal cost. Firms, institutions, and individuals wishing to
incorporate the technology in their systems and operations should write or
call for details.